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IN THE CLAIMS 

Please cancel claims 2, 4, 17, 18, 21, and 28 without prejudice. 

Please amend claims 1, 3, 5, 14, 19, 20, and 24 as follows: 

1 . (Currently Amended) A method for processing audio data, comprising: 
using discriminatively-trained classifiers that are time-delay neural network 

(TDNN) classifiers to produce a plurality of anchor model outputs: 

applying a the plurality of anchor models to the audio data; 

mapping the output of the plurality of anchor models into frame tags; and 

producing the frame tagsi 

wh e r ei n th e p l ural i ty of anchor mod el s compr i s e a d i scr i m i nativ el y - train e d 

2. (Canceled) 

3. (Currently Amended ) The method as set forth in claim 2 1, further 
comprising training the convolution a l TDNN classifier on data separate from audio data 
available in a use phase. 

4. (Canceled) 

5. (Currently Amended) The method as set forth in cla i m 4 claim 1 , further 
comprising training the TDNN classifier using cross entropy. 

6. (Original) The method as set forth in claim 1 , further comprising pre- 
processing the audio data to generate input feature vectors for the discriminatively-trained 
classifier. 

7. (Original) The method as set forth in claim 1 , further comprising normalizing 
a feature vector output of the discriminatively-trained classifier. 
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8. (Original) The method as set forth in claim 7, wherein the normalized feature 
vectors are vectors of unit length. 

9. (Original) The method as set forth in claim 1 , further comprising: 
accepting a plurality of input feature vectors corresponding to audio features 

contained in the audio data; and 

applying the discriminatively-trained classifier to the plurality of input feature 
vectors to produce a plurality of anchor model outputs. 

1 0. (Original) The method as set forth in claim 1 , wherein the mapping 
comprises: 

clustering anchor model outputs from the discriminatively-trained classifier 
into separate clusters using a clustering technique; and 

associating a frame tag to each separate cluster. 

1 1 . (Original) The method as set forth in claim 1 0, further comprising applying 
temporal sequential smoothing to the frame tag using temporal information associated with 
the anchor model outputs. 

12. (Original) The method as set forth in claim 1 , further comprising: 
training the discriminatively-trained classifier using a speaker training set 

containing a plurality of known speakers; and 

pre-processing the speaker training set and the audio data in the same 
manner to provide a consistent input to the discriminatively-trained classifier. 

1 3. (Original) A computer-readable medium having computer-executable 
instructions for performing the method recited in claim 1 . 

14. (Currently Amended) A computer-implemented process for processing audio 
data, comprising: 
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applying a plurality of anchor models to the audio data , the plurality of 
anchor models comprising discrimi natively-trained classifiers of a convolutional neural 
network that were previously trained using a training technique that included non-linear 
terms ; 

obtaining a preliminary output of the plurality of anchor models from the 
convolutional neural network before final non-linear terms are applied to generate a 
modified feature vector output; 

normalizing the modified feature vector output to generate normalized 
anchor model output; 

mapping the output of th e normalized anchor mod el s model output into 
frame tags; and 

producing the frame tagsf 

wh e r e in th e p l ura li ty of anchor mod el s compris e a d i scr i m i nat i v el y - tra i n e d 
c l ass i fi o r that is previously tra i n e d using a tra i n i ng t e chn i qu e. 

1 5. (Original) The computer-implemented process of claim 14, wherein the 
training technique employs a cross-entropy cost function. 

16. (Original) The computer-implemented process of claim 14, wherein the 
training technique employs a mean-square error metric. 

17. (Canceled) 

18. (Canceled) 

1 9. (Currently Amended) The computer-implemented process of claim +8 14, 
wherein normalizing further comprises creating a modified feature vector output having 
unit length. 

20. (Currently Amended) A method for processing audio data containing a 
plurality of speakers, comprising: 
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using discriminativelv-trained classifiers that are time-delay neural network 
(TDNN) classifiers to produce a plurality of anchor model outputs; 

applying a the plurality of anchor models to the audio data; 

mapping an output of the anchor models into frame tags; and 

constructing a list of start and stop times for each of the plurality of speakers 
based on the frame tags; 

wherein th o plura li ty of anchor mod el s compris e a discriminatively-trained 
classif ie r classifiers were previously trained using a training set containing a set of training 
speakers, and wherein the plurality of speakers is not in the set of training speakers. 

21. (Canceled) 

22. (Original) The method as set forth in claim 20, further comprising normalizing 
a feature vector output from the convolutional neural network classifier by mapping each 
element of the feature vector output to a unit sphere such that the feature vector output 
has unit length. 

23. (Original) One or more computer-readable media having computer-readable 
instructions thereon which, when executed by one or more processors, cause the one or 
more processors to implement the method of claim 20. 

24. (Currently Amended) A computer-readable medium having computer- 
executable instructions for processing audio data, comprising: 

training a discriminatively-trained classifier that is a time-delay neural 
network (TDNN) classifier in a discriminative manner on a convolutional neural network 
using a training technique that includes non-linear terms such that the training occurs 
during a training phase to generate parameters that can be used at a later time by the 
d i scriminat i ve^ tra i n e d TDNN classifier; 

using discriminativelv-trained classifiers that are time-delay neural network 
(TDNN) classifiers to produce a plurality of anchor model outputs; 
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obtaining the plurality of anchor model outputs from the convolutional neural 
network before final non-linear terms are applied to generate a modified plurality of anchor 
model outputs; 

normalizing the modified plurality of anchor model output to generate 
normalized anchor model outputs; 

app l y i ng th e d i scr i m i nat i v o ly trained c l ass i f ie r that us e s th e param e ters to 
th e aud i o data to g e n e rat e anchor mod el outputs; and 

clustering the normalized anchor model outputs into frame tags of speakers 
that are contained in the audio data. 

25. (Original) The computer-readable medium of claim 24, further comprising 
pre-processing a speaker training set during the training and validation phase to produce a 
first set of input feature vectors for the discriminatively-trained classifier. 

26. (Original) The computer-readable medium of claim 25, further comprising 
pre-processing the audio data during the use phase to produce a second set of input 
feature vectors for the discriminatively-trained classifier, the pre-processing of the audio 
data being preformed in the same manner as the pre-processing of the speaker training 
set. 

27. (Original) The computer-readable medium of claim 24, further comprising 
normalizing the feature vector outputs to produce feature vectors having a unit length. 

28. (Canceled) 

29. (Original) The computer-readable medium of claim 25, further comprising 
applying temporal sequential smoothing to the clustering the clustered feature vector 
outputs to produce the frame tags. 

Claims 30-60: Canceled 
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